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Abstract: While multiple testing procedures have been the focus of much 
statistical research, an important facet of the problem is how to deal with 
possible confounding. Procedures have been developed by authors in genetics 
and statistics. In this chapter, we relate these proposals. We propose two new 
multiple testing approaches within this framework. The first combines sensi- 
tivity analysis methods with false discovery rate estimation procedures. The 
second involves construction of shrinkage estimators that utilize the mixture 
model for multiple testing. The procedures are illustrated with applications to 
a gene expression profiling experiment in prostate cancer. 



1. Introduction 

In the study of complex diseases such as cancer, investigators have sought to impH- 
cate candidate gene polymorphisms with disease. Candidate gene polymorphisms 
have been typically identified in case-control studies, where their frequencies in 
cases and controls are typically compared. Such studies are quite commonplace in 
the scientific literature. 

For this context, we focus on the problem of multiple testing. A useful quantity to 
consider in multiple testing situations is the false discovery rate (FDR) (Benjamini 
and Hochberg [2]). Sabatti et al. [24] and Abecasis et al. [1] have advocated its use 
in genetic association studies. There has been a lot of work on statistical procedures 
involving the false discovery rate, which we review in Section 2. While much work 
has been done on multiple testing procedures, very little work to date has been 
done on estimation for this setting. 

Recently, Efron [11] has argued that these methods presume theoretical null dis- 
tributions, which might be incorrect, and has instead argued for use of an empirical 
null distribution. A parallel development of this problem has occurred in genetic 
association studies (Cardon and Bell [7]), in which statistical geneticists look at 
frequencies of alleles in affected (case) and unaffected (control) populations to de- 
termine associations between genes and disease. Wright [.30] argued that a variety 
of population genetic forces, such as nonrandom mating and genetic drift, might 
induce genetically similar subgroups within populations. This is referred to as pop- 
ulation structure in the genetic literature. Proposals for addressing this problem 
have been developed by Devlin and Roeder [10] and Rosenberg and Pritchard [23]. 
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In this article, we seek to unify these proposals in a framework for multiple 
testing in which we adjust for confounding. The population structure that must 
be accounted for in genetic association studies represents one type of confound- 
ing. There has also been mention on the role of other confounders in differential 
expression analyses for microarray data (Bhattacharya ct al. [(i] and Ghosh and 
Chinnaiyan [17]). Confounding is a problem in that it biases the associations that 
arc observed in the data. In other words, because of confounding, under the null 
hypothesis of no association, the usual test statistic does not have the correct mean. 

The structure of this paper is as follows. In Section 2, we define the false dis- 
covery rate and review the standard mixture model for multiple hypothesis testing 
that has been used in the literature. We then slightly generalize the model to al- 
low for confounding and then relate three proposals for multiple testing (Devlin 
and Roeder [10], Pritchard and Rosenberg [2-S] and Efron [11]) to this model. We 
then propose two analytical schemes for multiple testing in the presence of con- 
founding. The first involves utilization of sensitivity analysis methods to adjust for 
confounding (Lin et al. [19]), followed by q-value estimation (Storey [2S]) or a false 
discovery rate controlling procedure. This, along with the link to related shrinkage 
estimation procedures (James and Stein [IS], Sen and Saleh [2(), 27] and George 
[15]), is discussed in Section 3. In Section 4, we discuss estimation of association, 
along with confidence intervals in the multiple testing situation in the presence of 
confounding. We highlight the procedures with applications to a gene expression 
data from a prostate cancer study in Section 5. Finally, we conclude with some 
discussion in Section 6. 



2. Data structures, background and preliminaries 

Suppose there are g genes that we wish to make gene-specific hypotheses about. 
We can cross-classify the corresponding g hypotheses into their true status (i.e. null 
is true or alternative is true) and also based on whether or not we decide to reject 
the null hypothesis. Such a table is given in Table 1. 

The classical quantity that has been controlled in multiple testing problems has 
been the familywise type I error rate (FWER). Based on Table 1, the FWER is 
defined as P{V > 1). By contrast, the definition of false discovery rate (FDR) as 
put forward by Benjamini and Hochbcrg [2] is 



FDR = E 



V 

^\Q>0 



P{Q > 0). 



The conditioning on the event [Q > 0] is needed because the fraction V/Q is not 
well-defined when Q = 0. If all the null hypotheses are true (i.e., go = 5)j then 
control of the false discovery also provides control of the familywise type I error 
rate. In general, however, control of the two quantities are not equivalent, and the 
control of the false discovery rate is less conservative than that of FWER. 
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Outcomes of G tests of hypotheses 





Accept 


Reject 


Total 


True null 


U 


V 


Go 


True alternative 


T 


S 


Gi 




W 


Q 


G 



Multiple comparisons 



245 



FDR-related procedures fall into two classes: (1) methods that aim to control 
FDR; (2) methods for direct estimation of FDR based on a fixed rejection region. 
The first class of procedures has been proposed by many authors, such as Benjamini 
and Hochberg [2], Benjamini and Liu [3], Benjamini and Yekutieli [4] and Sarkar 
[25]. Efron et al. [12], Storey [28, 29] and Genovese and Wasserman [K!], 2005 have 
developed the second class of procedures. The two classes of methods have been 
unified by Storey et al. [30] and Genovese and Wasserman (2005). The FDR is also 
linked intimately to the q- value (Storey [2iS]), the smallest FDR at which a null 
hypothesis can be rejected. 

2. 1 . Mixture model framework for multiple testing 

For false discovery rate estimation procedures, many authors have studied a mixture 
model for multiple testing. To test each null hypothesis, let Tg denote the test 
statistic for the gth gene. Define indicator variables Hi, ... , Hq corresponding to 
Ti, . . . , Tg, where Hi = if the null hypothesis is true and Hi = 1 ii the alternative 
hypothesis is true. Assume that Hi , . . . , Hq are a random sample from a Bernoulli 
distribution where P{Hi = 0) = ttq, i = 1, . . . , G. We define the densities /o and fi 
corresponding to Ti\Hi = and Ti\Hi ~ 1, (i = 1, . . . , m). The marginal density of 
the test statistics Ti, . . . , T^^ is given by 

(2.1) /(t) = 7ro/o(t) + (l-7ro)/i(i)- 

The mixture model framework represented in (2.1) has been used by several authors 
to study the false discovery rate (e.g., Efron et al. [12], Storey [2n] and Genovese 
and Wasserman [14]). Given an estimate of ttq, methods for false discovery rate 
estimation have been developed by several authors (Efron et al. [12] and Storey 
[28]). There are several methods for estimating ttq in the literature, most of which 
are based on the defintion of ttq as the derivative of the density of test statistics 
corresponding to the null hypothesis evaluated at one. The latter definition can 
be found in Storey and Tibshirani [31]; methods of estimation of ttq can be found 
in Storey [2S], Pounds and Cheng [22] and Dalmasso et al. [X]. Intuitively, it is 
estimated based on p- values that in a region close to one; the size of the region will 
be inevitably linked to a bias- variance tradeoff regarding the estimate of ttq. 

2.2. Adjustments for confounding 

In model (2.1), authors have usually assumed /o to be known. For example, if the 
test statistics are p- values, then /o has been assumed to be the pdf of a Uniform (0,1) 
random variable. If the statistics are Wald-type statistics (e.g., two sample t-tests), 
then /o is usually taken to be a normal distribution with mean zero and variance 
one. 

There have been certain proposals in which authors attempted to use permuta- 
tion methods to estimate /o (e.g., Efron et al. [12]). This technique has been used 
to account for correlation between the hypotheses, e.g. genes on a microarray hav- 
ing correlation. Permutation is often carried out by interchanging labels between 
independent samples. However, a crucial assumption of permutation method is that 
under the null hypothesis, all assignments of class labels to samples are exchange- 
able. In the presence of confounding, this assumption is no longer true. This is why 
Efron [1 1] writes that "permutation methods do no automatically resolve the issue 
of the theoretical versus the empirical null hypothesis." 



246 



D. Ghosh 



To account for confounding, let us assume that under the nuU hypothesis, the ith 
test statistic T,; has distribution /(^(i), where the subscript C denotes confounding. 
Assume that under the alternative hypothesis, the distribution is for T!;, 

i = 1, . . . , n. Then we have the following mixture model: 

(2.2) T,'^'7:ofg{t) + {l-^,)fg{t). 

Assume further that /{^(f) does not depend on i. Suppose there were supplementary 
test statistics Ui, . . . , Uk available for which it were known that the null hypothesis 
was true. The framework of Devlin and Roeder [10] works in the following manner. 
Using a population genetic model and the presence of population stratification, 
they show that /^(t) will be relatively overdispersed compared to /oi(t), which they 
assume not to depend on i. To be specific, /^(i) = a^^ fo{t/ai). While there should 
also be a mean shift (i.e. bias correction) in fg relative to /o, Devlin and Roeder 
[10] argue that the overdispersion outweighs the bias; this has been supported by 
other authors using simulation studies (Wacholdcr et al. [-34]). Making a further 
assumption that the scale parameter ai is constant across the genome so that it no 
longer depends on i, Devhn and Roeder [10] use the statistics Ui, . . . , Uk to estimate 
a. Let the resulting estimator be denoted as a; one can then use in a^^ foit/a) as 
the reference null hypothesis for multiple testing. Devlin and Roeder [10] argue 
for the constant variance assumption across the genome using population genetic 
arguments. 

An alternative approach to adjusting for confounding was put forward by 
Pritchard and Rosenberg [2:!]. They postulated the existence of a latent variable C 
such that conditional on C = c, 

(2.3) T,\C = c TTofoit) + (1 - 7ro)/i(t), 

i.e. within subpopulations defined by C, the test statistics have the distribution 
defined as in (2.1). In their method, what is needed are supplementary data that 
can be used to define C for each individual. Just as in the genomic control method 
of Devlin and Roeder [10], these are data on genes for which the null hypothesis is 
known to be true. The approach of Pritchard and Rosenberg [23] is to infer sub- 
population membership (C) using the supplementary data and then do standard 
multiple testing within each subpopulation. Conditional on C (i.e., within subpopu- 
lations defined by C), the theoretical null and alternative distributions can be used 
for the multiple testing problem. For Pritchard and Rosenberg [23], they use the 
supplemental data to infer subpopulation membership and then for those defined by 
common values for the inferred cluster membership, they can perform the standard 
multiple testing procedures. 

The approach of Efron [11] involves the following model: 

(2.4) Tr^7T,fl^{t) + {l-7To)f^{t). 

Thus, the density of the test statistics corresponding to the null and alternative 
hypotheses are left unspecified. No relationship between /f (t) to fo{t) is formulated, 
making this approach more flexible than that of Devlin and Roeder [10]. In addition, 
no supplemental data are required on known true null hypotheses, in contrast to 
the Devlin and Roeder [10] and Pritchard and Rosenberg [23] proposals. To make 
estimation in (2.4) feasible, Efron [11] makes a zero-matching assumption, namely 
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that all statistics near zero come from the null hypothesis component and relates 
/i to /o via an exponential tilt (Lindsey [20]). 

In comparing the proposals, we find that both the method of Pritchard and 
Rosenberg [2.)] and Devlin and Rocder [ID] both require supplemental data, as well 
as an assumption that the supplemental data are not associated with the phenotype 
in any manner. By contrast, no supplemental data are needed for the method of 
Efron [11]. However, a desirable feature of the first two proposals mentioned is that 
they explicitly model the confounding in an easily interpretable way into the model. 
The Efron approach is more algorithmic in nature, although he mentions confound- 
ing as being one reason to prefer the empirical null relative to the theoretical null 
hypothesis. 

If we think of confounding as manifesting through a bias and variance adjust- 
ment to the theoretical null distribution, then we find that the genomic control 
method makes a variance adjustment to the theoretical null, while the approach of 
Pritchard and Rosenberg [23] proposal makes an adjustment to both through the 
latent variable C. The Efron [11] procedure also does the same without specifying 
a probabilistic model for the confounding. The first approach we propose in this 
paper is to postulate confounding in a manner analogous to Pritchard and Rosen- 
berg [23]; however, the actual multiple testing procedures do not require additional 
data, which makes it similar to the approach of Efron [ i i ] . An important step in 
application of the proposed methodology is the use of sensitivity analysis methods, 
which arc described next. 

3. Sensitivity analysis and multiple testing methodologies 

To incorporate confounding into the analysis, we will utilize the framework of Lin 
et al. [19]. They cast sensitivity analysis into a regression modelling framework in 
which the confounder was treated as an unobserved covariate. While they focused 
primarily on binary and time-to-event outcomes, the data we consider in Section 5 
involves a continuous response. Also, we have the issue of multiple testing. 

3.1. Continuous response 

We observe the data (X^, Dj), i = 1, . . . , m, where is the n-dimensional covari- 
ate vector and Di the binary phenotype for the ith individual. We formulate the 
following regression model relating X and D: 



where (/3oj/3i) are regression coefficients, C represents the confounder and 7 is the 
associated regression coefficient. Note that we assume no interaction between Q 
and Di in (3.1). Since C is not observed, we fit the reduced model: 



We want to determine the relationship between /?i and /?*. Note that integrating 
(3.1) with respect to the conditional distribution of C given D implies the following 
model: 



(3.1) 



E{Xk^\D,, U) = fJok + PikD, + jC, 



(3.2) 



i?(X«|A)=/3o\+A*feA. 



(3.3) 
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We now assume C\D to be normal with mean and variance one. For this situa- 
tion, plugging in the definition and equating regression coefficients for Di in (3.3) 
and (3.2) yields 

(3.4) Pik = Plk - lil^i - Mo). 

Thus, after the user specifies 7 and the difference (/xi — /io), the estimate of (3ik, 
(3ik, can be found quite easily by direct computation into (3.4) Since the inputs 
are constants, this has no impact on the standard errors of (3ik- One can then 
use $ik/SE0li.) as the Wald statistic which when compared to a standard normal 
distribution yields a p- value. Again, the standard errors are unaffected by the sensi- 
tivity analysis parameters, and we can obtain p-values based on the Wald statistic. 



3.2. Q-value/FDR adjustment for multiple testing 

Based on the p-values calculated in the previous section, one can construct 
hypothesis-specific q- values (Storey and Tibshirani [31]). Q- values are defined to 
be the smallest FDR at which a test of hypothesis is significant, analogous to the 
p-value being the smallest level of significance at which a test of hypothesis is 
deemed to be significant. Here is the q- value construction algorithm: 

1. Order the G p-values as < p^2) < • • • < P{G)- 

2. Construct a grid of L A values, Ai, . . . , Al and calculate 



7i"o(A0 



Gil-Xi) 



1^1,. ..,L. 

3. Fit a cubic smoothing spline to the values {A;, •n-o(A;)}, I = 1, . . . , L. 

4. Estimate ttq by the interpolated value at A = 1. 

5. For the gene with the largest p-value the q- value is given by 

q{P{G)) = mm — — - = 7roP(G), 

t>P(G) mPj <t} 

and for i = G - 1, G - 2, . . . , 1, q{p(i)) = min(7ToGp(i)/i,p(j+i)). This guar- 
antees that the q-values will be monotonically increasing as a function of 
p-values. 

An alternative approach is to use the p-values in the Benjamini-Hochberg [2] 
procedure for controlling the false discovery rate: 

(a) Let < p(2) < • • • < P(G) denote the ordered, observed p-values. 

(b) Find k = max{l < k < G : p^k^ < ak/G}. 

(c) If k exists, then reject null hypotheses < ••• < Pf^j^y Otherwise, reject 
nothing. 

The major adjustment in these procedures is that we are accounting for potential 
adjustment of confounding, which impacts the p-values that get used in the multiple 
testing procedures. 

Note that while we have used q-values to adjust for the multiple testing problem, 
the p-values can be input into any single-step algorithm for multiple testing. If we 
wished to control the familywisc type I error level instead, we could use any of the 
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procedures from Westfall and Young (1993) or some of the more recent methods 
from Van dor Laan and Dudoit and colleagues (Van der Laan et al. [32, 33]). 

One thing to note about the sensitivity analysis method described in Section 3.1 
and is that it is effectively making an mean-shift adjustment to the theoretical null 
distribution. In particular, the standard errors of the estimators are not affected by 
the sensitivity analysis method of Lin et al. [1!)]. The multiple testing model that 
adjusts for confounding assumed here, in the notation of Section 2, is the following: 

(3.5) r.|e TTofoit - 6) + (1 - TTo)fi{t), 

where & denotes the sensitivity analysis inputs and arc fixed constants specified 
by the user. One advantage of such a model for confounding is that it requires 
minimal user input and is easy to implement. A second advantage is that what are 
affected in this procedure is the estimate of ttq and the actual q-values themselves. 
In settings where investigators are interested in the selection of hypotheses, these 
quantities might be all that is of interest. 

In comparing this mixture model with (2.2) and (2.3), we find that the sensitivity 
analysis approach models confounding as a shift in the mean for the confounding 
null density relative to the theoretical null density. We allow for nonconstant vari- 
ance across the genome, in contrast to the approach of Devlin and Roeder [10]. In 
addition, unlike Devlin and Roeder [10], we model the bias due to confounding. 
By comparison, the Pritchard and Rosenberg [23] and Efron [11] approaches would 
allow for both a mean and variance adjustment for the null hypothesis density. 
It appears that combining the approach here with that of genomic control should 
provide an answer that is closer to the Pritchard and Rosenberg [23] proposal, if 
supplemental data were available. 

3.3. Shrinkage estimation for the q-value 

We now heuristically describe an argument, given in more detail in Ghosh [16], that 
shows how the q-value can be motivated as a shrinkage estimator in the multiple 
testing problem. Consider model (2.1) again, where the test statistics are the p- 
values. Then one could construct shrinkage estimators of the p-values where they 
are shrunk towards the components of the mixture model defined by (2.1). If we 
assume that the density of the null hypothesis is a point mass at one and that of 
the alternative is zero, then it is shown in Ghosh [10] that a shrinkage estimator 
of the p-value in this problem is lower bounded by the q-value. Thus, the q-value 
approach advocated by Storey [28] enjoys a nice shrinkage property. However, the 
definition of the q-value given by Storey [28] is appropriate to the situation of 0/1 
loss (misclassification error). The shrinkage estimation procedure by Ghosh [10] 
could generalize easily to other loss functions, which might be more appropriate in 
other scientific contexts. 

4. Estimators of association in the presence of confounding 

A limitation of all these approaches is that they do not produce multiple-testing 
adjusted estimates of association or confidence intervals. We now discuss the prob- 
lem of how to estimate association measures in the multiple testing problem in the 
presence of confounding. While the multiple testing problem has been considered 
quite intensively in the statistical literature, that of estimating associations has not. 
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This was pointed out recently by Prentice and Qi in a genomewide study ex- 
ample. Our approach is based on the idea that we can use (2.1) to define estimation 
targets under the null and alternative hypotheses. One then can define Empirical 
Bayes estimators that shrink the observed test statistic towards each target with 
appropriate mixing weights. 

For estimation. Jamcs-Stein (James and Stein [18]) estimators arc used. We now 
construct the double shrinkage estimators using the test statistic model (2.1) where 
F = 7roi^o+(l— 7''o)-Fi- To do this, we utilize the density estimation method proposed 
by Efron [11]. Note from (2.1) that we have 

We can estimate /(t) by applying density estimation methods to Ti, . . . ,Tg- For 
estimation of TTQfo{t), the zero assumption in Efron [ . ] is utilized. What this means 
is that most test statistics with a value near zero comes from the null distribution 
component. The assumption, combined with a normal-based moments matching 
technique and density estimation procedures as described in Efron [11], allows one 
to obtain an estimate of ttq and foit). Given them, wc then obtain an estimate of 
TTi and fi{t) by simple subtraction. Based on the estimates of /o(i)j /i(0 ^'^d ttq, 
we can estimate Tf^ , . . . , T,f'^ by 



(4.1) 



JS 



where 



JS 



Oi 



rpj S 



T 



T, 




[Tt - Ao), 



Ai), 



T^kfk{t) 



7ro/o(i) + (l-^o)/i(t) 



/io = JtdFa{t) and fii = JtdFi{t). 

Before discussing the issue of confidence intervals, note that we are using the 
model framework of Efron [11] to deal with confounding, i.e., model (2.4). As men- 
tioned before, estimation in this model requires no supplemental data on known 
null hypotheses. 

Along with estimators of the parameters that adjust for multiple testing, it is 
useful to have confidence intervals that also account for the multiple testing phe- 
nomenon. To calculate the confidence intervals, we will use the following simulation- 
based algorithm: 



rrJS 



rf^JS 



1. Estimate ttq, /o and /i using the algorithm above. 

2. Construct the double shrinkage estimators oi ^i, . . . , /ic, 

3. Sample with replacement G observations fi^, . . . , from the estimated den- 
sity ^o/o(i) + (1 - ^o)fi{t) and generate Tl,, . . . , T^^^, where T^^ - N{ff^, 
1 + aj), where a J is the empirical variance of the ^*'s sampled in Step 2. 



' "! i. 



to 



Repeat Step 3 B times. Use the empirical distribution of T*^ 
calculate confidence intervals for /i^, i = 1, . . . , G. 

Multiply by the gene-specific standard deviations to get confidence intervals 
on the scale of the parameter. 
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Based on the bootstrap distributions, we can construct confidence intervals for each 
of the components of = (^i, . . . , no)- We have chosen to use equal tail confidence 
intervals, e.g. a 95% confidence interval would be based on the 2.5tli and 97.5th 
percentiles of the empirical distribution of T* . 



5. Numerical example: Microarray profiling study in prostate cancer 

We now illustrate the proposed methodologies using two genomic studies. The 
first comes from prostate cancer gene expression profiling experiment reported in 
Varambally et al. [35]. The investigators used cDNA microarrays containing 9984 
genes to profile tissue samples from various stages of prostate cancer (normal ad- 
jacent prostate, benign prostatic hyperplasia, localized prostate cancer, advanced 
metastatic prostate cancer). 

Measurements were made on 9984 genes for 101 individuals, of whom 79 have 
cancers and 22 do not. Before analyzing the data, wc filtered out genes that had 
a sample variation less than 0.05 across all samples. This left a total of G = 6253 
genes available for analysis. 

First, a q- value analysis (Storey [28]) was done using the unpooled two-sample 
t-test; this is reported in Figure 1. For this analysis, we assumed a theoretical null 
distribution of iV(0, 1) for each t-statistic. One point of note from the plot is the 
estimate of ttq, the fraction of null hypotheses estimated to be true, is 0.49. As 
a result, we are finding a large proportion of genes to be differentially expressed. 
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Fig 1. Results of the q-value analysis (Storey [is]) for prostate cancer gene expression data. 
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Table 2 

Results of sensitivity analyses for gene expression data 



7 


(/ii - Ho) 


*o 


Number q-values <0.05 








0.49 


2099 


1.5 


0.01 


0.86 


1762 




0.1 


0.47 


613 




0.3 


0.05 


6227 




0.5 


0.01 


6253 




1 


0.006 


6253 


0.1 


0.01 


0.49 


2415 




0.1 


0.78 


1902 




0.3 


0.86 


1466 




0.5 


0.80 


1154 




1 


0.69 


586 



As argued by Ghosh and Chinnaiyan [l/]. there is a variety of confounders that 
might eause the apparent observed difference between the cancer and noncanccr 
samples. The imphcation is that the standard normal distribution might not be the 
correct reference null distribution to use. We tried adjustments to the t-statistic 
based on the sensitivity analysis approach. First, we assumed a confounder that 
was continuous and tried various values of (7, //i, /xq) to obtain estimates of ttq and 
the q-values. Note that what matters is the difference in means /ii — Ho, so we 
considered changes of 0.01,0.1,0.3,0.5 and 1. We have aggregated the sensitivity 
results into Table 2. What we find is that the estimate of ttq is extremely sensitive 
to confounders. Interestingly, as the product of 7 and (/xi — fio) decreases, we find 
that the estimate of the proportion of true null hypotheses increases, which leads to 
a decrease in the number of genes being called significant if we use a cutoff of 0.05, 
say. However, we also see that there is no simple relationship between the amount of 
confounding and the estimate of ttq. The estimate of number of genes being called 
significant is also sensitive to the estimate of ttq. For example, there is very little 
difference in the estimates of ttq for the unadjusted and (7,^1 — /.to) = (0.1,0.01) 
situation, yet we are calling approximately four hundred more genes significant in 
the second scenario. This emphasizes the need to have good estimates of ttq. 

Next, we utilized the Efron [ \ I] method for estimating the local false discovery 
rates; the results are presented in Figure 2. Again, this procedure corresponds to 
a data-driven adjustment to null hypotheses for confounding. Now note that the 
estimate of ttq from the Efron [11] procedure is 0.88, almost 80% larger than the 
estimate using the method of Storey [28]. This fits into our intuition that adjusting 
for confounding should decrease the number of significant genes. We next calculated 
the double-shrinkage estimators for the t-statistics using the method in Section 3.1. 
We then combined an FDR-thresholding procedure with the inference procedures 
described in the paper. This was done in the following manner. We applied the 
q-value procedure of Storey [28] and selected the genes with the twenty smallest q- 
values. We reported the effects and associated 95% confidence intervals in Table 3. 
Note that the 95% CIs should be conservative in that we have ignored the selection 
process (i.e. taken the twenty genes with the smallest q-values). 

6. Discussion 

In this article, we have addressed the issue of multiple testing procedures in the 
presence of confounding. First, we sought to unify proposals from the statistical 
and genetics literature on the problem. Doing so clarifies and provides a stronger 
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-5 5 

delhat= -0.315 sighat= 1.8 pO= 0.879 

Fig 2. Results of the t-test statistics from the prostate cancer gene expression data using the local 
false discovery rate estimation procedure of Efron [11] ■ 



Table 3 

Results for top 20 genes for prostate cancer gene expression data 



Gene name 


Estimator S 


P 


95% CI 


Transforming growth factor, beta 1 


6.99 


0.30 


(0.09,0.52) 


Fcr-l-likc 3, myofcrlin (C. elcgans) 


8.36 


0.55 


(0.14,0.93) 


O-linked N-acotylglucosamine transferase 


-9.11 


-0.51 


(-0.88, -0.18) 


Human calmodulin-I (CALMl) mRNA 


6.87 


0.56 


(0.17,0.98) 


Hepsin (transmembrane protease, serine 1) 


-6.89 


-0.95 


(-1.62, -0.30) 


Caveolin 2 


7.44 


0.53 


(0.16,0.85) 


Ras homolog gene family, member B 


6.7 


1.41 


(0.42,2.40) 


Zinc finger protein 36, C3H type-like 1 


6.76 


0.52 


(0.16,0.87) 


Hepatic leukemia factor 


6.72 


0.46 


(0.11,0.78) 


Phospliatidic acid phosphatase type 2B 


6.59 


0.34 


(0.08, 0.60) 


Multiple endocrine neoplasia I 


-6.51 


-0.31 


(-0.54, -0.08) 


Endothelin receptor type A 


6.47 


0.51 


(0.12,0.92) 


Transforming growth factor, beta receptor HI 


6.38 


0.54 


(0.10,0.98) 


Growth factor receptor-bound protein 2 


-6.31 


-0.42 


(-0.77, -0.12) 


Hermansky-Pudlak syndrome 1 


6.32 


0.34 


(0.07,0.58) 


Dickkopf homolog 3 (Xenopus laevis) 


6.29 


0.34 


(0.09,0.61) 


Phosphatidic acid phosphatase type 2B 


6.24 


0.32 


(0.06,0.57) 


Tissue inhibitor of metalloproteinase 3 


6.2 


0.55 


(0.12, 1.00) 


Hypothetical protein hCLA-iso 


-6.13 


-0.38 


(-0.70, -0.09) 


Zinc finger protein 36, C3H type-like 2 


6.08 


0.34 


(0.08, 0.62) 
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justification for preferring the empirical null distribution that Efron [11] has recently 
advocated. It is also clear from the discussion that what the methods of Devlin 
and Roeder [10] and Pritchard and Rosenberg [23] provide are adjustments to the 
theoretical null, using supplemental data and under various assumptions on the 
nature of confounding and its effects on location and scale parameters. 

Second, we have proposed two analytical approaches. The first is a sensitivity 
analysis approach using a methodology described by Lin et al. [19], which is then 
plugged into standard FDR methodology. As shown in the article, this can be 
viewed as a mean-shift adjustment to the theoretical null. The second procedure 
in the article we pursue is one using Empirical Bayes methodology. This involves 
utilizing the mixture model for hypothesis testing in the presence of confounding 
proposed by Efron [1 ) ] and constructing James-Stein estimators of the test statistic 
that shrinkage towards each target (that specified under the null and that under 
the alternative) with data-dependent weights. What is novel here is that the mean 
value of the distribution of the test statistic under the null is also estimated as part 
of the procedure. This leads to calibration of the test statistics under the empirical 
null hypothesis (Efron [11]) rather than the theoretical null hypothesis. We also 
develop confidence intervals in the multiple testing setting, which has not been 
discussed very much in the literature, with the major exception being the work on 
FDR-controlling confidence intervals by Benjamini and Yekutieli [■')]. 

The methodology proposed here is fairly general. The sensitivity analysis regres- 
sion models considered here are a subset of those considered by Lin et al. [19]; 
the methodology proposed here could apply more generally to censored outcomes 
through proportional hazards regression models and count responses using Pois- 
son regression models. The double shrinkage estimation methodology in Section 4 
requires having Wald-type estimators. 

While there have been Empirical Bayes methods for multiple testing proposed 
in the literature (Efron et al. [12], Datta and Datta [9] and Ghosh [Id]), most of 
this work has focused on hypothesis testing and selection of hypotheses. While this 
is useful in screening problems, it might also be of interest to report estimated 
effects and confidence intervals corresponding to the rejected null hypotheses. A 
major potential advantage of Empirical Bayes (and more generally, fully Bayesian) 
methods in this setting is that parameter estimates and confidence intervals will 
be unaffected by the selection of which hypotheses to reject, provided the testing 
procedure is Bayesian. This is an area that is currently under investigation. 

Acknowledgments. The author would like to thank Tom Nichols and Trivellore 
Raghunathan for helpful discussions. 
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